Closed Bug 1728747 Opened 4 years ago Closed 2 years ago

Randomly high CPU, typing lag, tons of TCP send / receive, from/to localhost for hours in remote desktop environment

Categories

(Thunderbird :: Untriaged, defect)

Unspecified
Windows
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: duparchy, Unassigned)

References

()

Details

(Keywords: perf)

Attachments

(5 files)

Attached image TB-TCP.png

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

Nothing special

Actual results:

Thunderbird has been eating the CPU for hours now.
Monitoring TB process with procmon (windows), Thunderbird seems to lost in a loop of TCP send / receive from/to localhost.

What's the point of those loopback transmit ?

In fact the very problem, maybe not related to those loopback connection, is that users are experience typing lags (again...)

Are these same users on RDS?

Flags: needinfo?(duparchy)
Keywords: perf

Hi, yes. That's my first attempt to understand the bug/feature before opening that case.

Flags: needinfo?(duparchy)

I manage 6 Windows 2019 RDSH servers and ~70 users.

This bug occurs randomly on different users, servers.

I will have to rervert back to TB 60 (again..).

See Also: → 1668811

Hi,
As far as I can tell from the user point of view, this is not related to 166881.

Here we have I think a bug, triggering a 20 years old questionable design (loopback TCP connexion).
This is resulting in a kind of DoS attack.

Beside the bug, the inter-process loopback connextion does not seems to be a true loopback to localhost.
Or is it just Process Monitor that translate "localhost" to the FQDN ?

Because true loopback connexions are supposed to be optimized for inter-process communications. See https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-r2-and-2012/hh997026(v=ws.11)

Summary: High CPU - tons of TCP send / receive, from/to localhost for hours. → High CPU - Typing lag - tons of TCP send / receive, from/to localhost for hours.

I've seen that TB 91 is now multi-process. Is there a chance that this inter-process communication through loopback (or pseudo-loopback) interface has been improved / re-designed ?

Attached image TB-TCP2.png

Uploaded another capture of that tcp loopback flooding. Mind the 62% of the overall server I/O events. Looks like A true denial of service.
(Note that I checked with netstat -ano. This is a true loopback (127.0.0.1). procmon.exe translates it to the fqdn.)

Upgraded to TB 91.1.

This TCP loopback connection bug/feature persists.
On occasion a user's TB process will seemingly be going avoke.

This TCP loopback connection bug/feature didn't show up for days now. Maybe it's gone.

Monitoring TB activity with procmon I still see tons of registry query. All the same in a row. There's perhaps room for improvement here.

What's the point of those dozens of RegQueryValue :

HKLM\SOFTWARE\Microsoft\Input\InputServiceEnabledForCCI

HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\OOBE\LaunchUserOOBE

One occurrence today of that seemingly infinite TCP Loopback connexions.
So it's still there...

Again...

I checked the user's settings.

  • Two IMAP accounts. Both accounts are set to not synchronize locally
  • No add-ons
  • No Global indexer.

Is there something I can do to help resolve this bug ?
Logging ?

For one user where this problem occurs frequently I've created a profile from scratch.
Problem NOT fixed.
This seems to be worst..... Instead of taking 6% (of 14vCPU... i.e ~85% of one CPU), I see TB process lost in loopback connexions reaching 11%

  • a second process taking 5% (of 14vCPU).
    This creates problems when "real time" networking is required. Other users on Zoom , Teams etc.. are experiencing problem with audio/video.

up.

Is there something I can do to help resolve this bug ?
Logging ?

99.99% chance that this process, which takes 6% of 14CPu permanently is lost in "localhost" loop

This are TB process for 9 differents users up there.

Please try version 91 with Help > Troubleshoot mode

Flags: needinfo?(duparchy)

I already tested a newly created profile for a user (w/o extension).
So unless there's something else that I can diagnose in troubleshooting mode, I don't think it's worth the trouble to disturb a user.
As I said, this problem, as harmless as it looks, is in fact a kind of Denial Of Service and slows down the entire server.
This is not just me, this is randomly killing one server after another in an an entire RDSH farm infrastructure w/ 8 servers and 85 users.

Anyway, Thunderbird makes also way too much disk access for a cloud infrastructure using shared iSCSI or FC storage array.
Unless steps are done to improve that situation, we won't use it for long. This is sad.

Flags: needinfo?(duparchy)

TB 91.3. No improvement.

Two Thunderbird process for two different users accounting for 54% of all events on that server.

Attached image tcp-tcp4.png

No improvement w/ 91.3

(In reply to duparchy from comment #7)

I've seen that TB 91 is now multi-process. Is there a chance that this inter-process communication through loopback (or pseudo-loopback) interface has been improved / re-designed ?

Not AFAIK, which your testing confirms. No idea where this traffic is coming from. Maybe Magnus has an idea.

Anyway, Thunderbird makes also way too much disk access for a cloud infrastructure using shared iSCSI or FC storage array.

Yes, this has been true for many years. Debilitating in some cases. Is there any possibility to put Thunderbird data on the server's local disk, which should help? (for example disk local on the hypervisor)

I mention this because there will be no relief coming from Thunderbird until the buffering issues are fixed by benc's refactoring and Bug 1121842 - [META] RFC: C-C Thunderbird - Cleaning of incorrect Close, unchecked Flush, Write etc. in nsPop3Sink.cpp and friends.

Flags: needinfo?(mkmelin+mozilla)
Flags: needinfo?(duparchy)

Hi,

This idea behind a "cloud" infrastructure is that everything is backed-up at the storage array level.
Plus, there will be some level of high-availability (Live volume, etc...) .
Not only moving some data on local disks would be cumbersome (to edit everyone's Thunderbird to move her/his profile) but this is defeating the entire "Cloud" idea.
In addition that would be impossible when me move our private cloud to a Cloud provider (AWS, Azure etc..)

For good or bad, we are living the "Cloud" days, at least for professionals.
Developers should stop thinking that everyone sits beside his or her own brick-and-mortar height-core w/ 32G of RAM.

Thanks for trying to push the idea to whom it may concern.

And Thanks for you support.

Flags: needinfo?(duparchy)

No idea what would cause it, but probably dupe of bug 1732926. Try bug 1732926 comment 15 and report back there.

Flags: needinfo?(mkmelin+mozilla)

Yes I could try to disable the multi-process. But given the fact that feature/bug was present before TB 91 multi-process, I doubt it will have any effect.

Just to clarify ... this issue doesn't exist for you in version 68?
The port numbers are 55238, 55239, 52682, 52689?

(In reply to duparchy from comment #27)

that feature/bug was present before TB 91 multi-process

True, but it will at least remove one variable from the diagnosis process. Lest we forget about it, I suggest that it stay disabled until all your problems are resolved.

Flags: needinfo?(duparchy)
OS: Unspecified → Windows
Summary: High CPU - Typing lag - tons of TCP send / receive, from/to localhost for hours. → Randomly high CPU, typing lag, tons of TCP send / receive, from/to localhost for hours in remote desktop environment

Maybe it was simply unnoticed in TB 60 , but users didn't complain about performance and lags

Right now three persons in a raw on the same server with the "send-receive gone crazy "problem.

Flags: needinfo?(duparchy)
Attached image tb.png

Still there. (Not checked w/ TB 100+ though)

Most of the time it goes unnoticed because we're on 10Gb network / 16 CPUs.

Up until high cpu/network loads (several Zoom in a raw. We're talking about a RDSH server w/ many users) reveals that underlying problem.

Reporter, does this still fail for you when using version 102 or newer version?

Whiteboard: [closeme 2022-11-15]

Hi,
I rolled it out to our RDSH servers last week so I can't tell for sure if the problem is gone.
TB 102 seems much more performant.
Definitely improved on I/Os.

Though, I still see some dubious I/Os through locahost TCP. But I've not seen any TB process going crazy so far.

Are there any information about underlying changes that would make us confident about the resolution of that problem ?

Resolved per whiteboard

Status: UNCONFIRMED → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
Whiteboard: [closeme 2022-11-15]

Hi,

Here we go again.

Upgraded to 115.3.1
To help a user I explained him how to do a "repair folder".. and did it on my on Inbox (10K message).

It's now been 18h now that TB is eating my CPU on TCP Sends/Receives.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: